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ABSTRACT 

Modeling student knowledge is a fundamental task of an in- 
telligent tutoring system. A popular approach for modeling 
the acquisition of knowledge is Bayesian Knowledge Trac- 
ing (BKT). Various extensions to the original BKT model 
have been proposed, among them two novel models that 
unify BKT and Item Response Theory (IRT). Latent Fac- 
tor Knowledge Tracing (LFKT) and Feature Aware Student 
knowledge Tracing (FAST) exhibit state of the art predic- 
tion accuracy. However, only few studies have analyzed the 
characteristics of these different models. In this paper, we 
therefore evaluate and compare properties of the models us- 
ing synthetic data sets. We sample from a combined stu- 
dent model that encompasses all four models. Based on the 
true parameters of the data generating process, we assess 
model performance characteristics for over 66’000 parame- 
ter configurations and identify best and worst case perfor- 
mance. Using regression we analyze the influence of different 
sampling parameters on the performance of the models and 
study their robustness under different model assumption vi- 
olations. 
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1. INTRODUCTION 

A fundamental part of an intelligent tutoring system (ITS) 
is the student model. Task selection and evaluation of the 
student’s learning progress are based on this model, and 
therefore it influences the learning experience and the learn- 
ing outcome of a student. Thus, accurately modeling and 
predicting student knowledge is essential. 


Approaches for student modeling are usually based on two 
popular techniques: Item Response Theory (IRT) [36] and 
Bayesian Knowledge Tracing (BKT) [9]. The concept of 
IRT assumes that that the probability of a correct response 
to an item is a mathematical function of student and item 
parameters. The Additive Factors Model (AFM) [7, 8] fits 
a learning curve to the data by applying a logistic regres- 
sion. Another technique called Performance Factors Analy- 
sis (PFA) [27] is based on the Rasch item response model [12] . 
BKT models student knowledge as a binary variable that can 
be inferred by binary observations. Performance of the orig- 
inal BKT model has been improved by using individualiza- 
tion techniques such as modeling the parameters by student 
and skill [23, 35, 39] or per school class [34]. Clustering ap- 
proaches [25] have also proven successful in improving the 
prediction accuracy of BKT. Furthermore, hybrid models 
combining the approaches of IRT and BKT have been pro- 
posed. In [17] a dynamic mixture model has been presented 
to trace performance and affect simultaneously. The KT- 
IDEM model extends BKT by introducing item difficulty 
parameters [22]. Other work focused on individualizing the 
initial mastery probability of BKT by using IRT [38] . Logis- 
tic regression has also been used to integrate subskills into 
BKT [37] . Recently, two models have been introduced which 
synthesize IRT and BKT. Latent Factor Knowledge Tracing 
(LFKT) [18] individualizes the guess and slip probabilities 
of BKT based on student ability and item difficulty. Feature 
Aware Student Knowledge Tracing (FAST) [14] generalizes 
the individualized guess and slip probabilities to arbitrary 
features. 

Lately, the analysis of properties of BKT has gained increas- 
ing attention. It has been shown [5] that learning BKT 
models exhibits fundamental identifiability problems, i.e., 
different model parameter estimates may lead to identical 
predictions about student performance. This problem was 
addressed by using an approach that biases the model search 
by Dirichlet priors to get statistically reliable improvements 
in predictive performance. [33] extended this work by per- 
forming a fixed point analysis of the solutions of the BKT 
learning task and by deriving constraints on the range of 
parameters that lead to unique solutions. Furthermore, it 
has been shown that the parameter space of BKT models 
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can be reduced using clustering [30]. Other research focused 
on analyzing convergence properties [24] of the expectation 
maximization algorithm (EM) for learning BKT models and 
exploring parameter estimates produced by EM [15]. It has 
been shown that convergence in the log likelihood space does 
not necessarily mean convergence in the parameter space. 
[11] have studied how good BKT is at predicting the moment 
of mastery. Different thresholds to assess mastery and their 
corresponding lag, i.e., the number of tasks that BKT needs 
to assess mastery (after mastery has already been achieved), 
have been investigated. Using multiple model fitting pro- 
cedures, BKT has been compared to PFA [13]. While no 
differences in predictive accuracy between the models have 
been reported, it has been shown that for knowledge tracing 
EM achieves significantly higher predictive accuracy than 
Brute Force. Findings from other studies, however, suggest 
the opposite [1, 2[. In [4], upper bounds on the predictive 
performance have been investigated by employing various 
cheating models. It has been concluded that BKT and PFA 
perform close to these limits, suggesting that other factors 
such as robust learning or optimal waiting intervals should 
be considered to improve tutorial decision making. The pre- 
dictive performance of LFKT and FAST has been compared 
to KT and IRT models in [19]. The evaluation is based on 
data from different intelligent tutoring systems. 

In this work, we are interested in the properties of hybrid 
approaches combining latent factor and knowledge tracing 
models. In extension to previous work and especially to [19] , 
we empirically evaluate the performance characteristics of 
the two recent hybrid models LFKT and FAST on synthetic 
data and compare them to the underlying approaches of 
BKT and IRT. We sample from a combined student model 
that encompasses all four models. By using synthetic data 
generated from the combined model, we show the robust- 
ness of the models under breaking model assumptions. By 
evaluating the models on 66’000 different parameter configu- 
rations we are able to rigorously explore the parameter space 
to demonstrate the relative performance gain between mod- 
els for various regions of the parameter space. Our findings 
show that for the generated data sets FAST significantly out- 
performs all other methods for predicting the task outcome 
and that BKT is significantly better than FAST and LFKT 
at predicting the latent knowledge state. Furthermore we 
are able to identify the influence of different properties of a 
data set on model performance using regression and show 
best and worst case performances of the models. 

2. INVESTIGATED MODELS 

In an intelligent tutoring system a student is typically pre- 
sented with a set of tasks to learn a specific skill. For each 
student n the system chooses at time t an item i from a set 
of items corresponding to a particular skill. The system then 
observes the answer y n ,t of the student, which is assumed to 
be binary in this work. In the following, we briefly present 
four common techniques to model various latent states of 
the student and the tutoring environment. 

BKT. Bayesian Knowledge Tracing (BKT) [9] models the 
knowledge acquisition of a single skill and is a special case 
of a Hidden Markov Model (HMM) [29]. BKT uses two 
latent states ( known and unknown) to model if a student 
n has mastered a particular skill k n ,t at time t, and two 


observable states ( correct and incorrect) to represent the 
outcome of a particular task. Therefore, the probabilistic 
model can be fully described by a set of five probabilities. 
The initial probability of knowing a skill a-priori p(k n ,o) is 
denoted by pi. The transition from one knowledge state 
k n ,t- 1 to the next state k n ,t is described by the probability 
Pl of transitioning from the unknown latent state to the 
known state and the probability pf of transitioning from 
the known to the unknown state: 

p(k n ,t) = k„,t- i(l - Pf) + (1 - k n ,t-i)pL- (1) 

In the case of BKT, pf is fixed at 0. Finally, the task out- 
comes y n ,t are modeled as 

p(y n ,t) = kn,t( 1 - ps) + (1 - k n ,t)pG, (2) 

where ps denotes the slip probability, which is the probability 
of solving a task incorrectly despite knowing the skill, and 
PG is the guess probability, which is the probability of cor- 
rectly answering a task without having mastered the skill. 
Learning the parameters for a BKT model is done using 
maximum likelihood estimation (MLE). 

IRT. Item Response Theory (IRT) [36] models the response 
of a student to an item as a function of latent student abil- 
ities Or i and latent item difficulties di. The simplest form of 
an IRT model is the Rasch model, where each student n and 
each item i are treated independently. The outcome y n ,t at 
time t is modeled using the logistic function 

P{Vn,t) = (l + e~ {<>rl ~ di)S ) . (3) 

A student with an ability of Q„ = di has a 50% chance of 
getting item i correct. In contrast to BKT, IRT does not 
model knowledge acquisition. The model parameters for the 
Rasch model are learned using EM. 

LFKT. The Latent Factor Knowledge Tracing (LFKT) [18] 
model combines BKT and IRT using a hierarchical Bayesian 
model. On the basis of the BKT model, slip and guess prob- 
abilities are individualized based on student ability and item 
difficulty as 


l + e -(d i -e„+7 G )^ 

(4) 

1 + e -(0„-d i + 7 s)'j- 1 i 

(5) 


where 7 g and 7 s are offsets for the guess and slip proba- 
bilities. The model is fit by calculating Bayesian parameter 
posteriors using Markov Chain Monte Carlo. 

FAST. Feature Aware Student Knowledge Tracing (FAST) 
[14] allows for unification of BKT and IRT as well, but gen- 
eralizes the individualized slip and guess probabilities to ar- 
bitrary features. Given a vector of features f n . t for a student 
n at time t the adapted emission probability reads as 

p{yn,t) = (l + e _t “ TfrM) ) , (6) 

where cu is a vector of learned feature weights. If a set of 
binary indicator functions for the items and the students are 
used, FAST is able to represent the item difficulties di and 
student abilities 0 n from the IRT model. The parameters 
are fit using a variant of EM [6] . 
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3. SYNTHETIC DATA GENERATION 

Synthetic data is needed to have ground truth about the 
underlying data generating model, which enables the exper- 
imental evaluation of various properties of a model. 

The sampling procedure starts by generating N student abil- 
ities 9 n from a normal distribution 77(0, cr). Then, it gener- 
ates I item difficulties di from a uniform distribution U (—5, 5) 
Based on the initial probability pi and the learn probabil- 
ity pl a sequence of knowledge states k n ,o, k n , l, ■ ■ • , k n ,T is 
sampled based on (1) and we therefore simulate data from 
only one skill. The time t* at which k n ,t* = 1 for the first 
time is considered as the moment of mastery. The number 
of sampled knowledge states is then given as T = t* + L, 
where L denotes the lag of the simulated mastery learning 
system. For each student we generate a random sequence of 
items, i.e., item indices i. Arbitrary features from the train- 
ing environment, such as answer times, help calls, problem 
solving strategy, engagement state of the student and gam- 
ing attempts, can have an influence on the performance of a 
student. To simulate those influences in a principled way, a 
single feature / is added to the data generating model with 
a varying feature weight ui (and thus varying correlation to 
the task outcomes 

Based on these quantities, we sample the observations y n ,t 
from a Bernoulli distribution with probability 

p(v„,t) = (l + _1 , (7) 

where 

7 n,t = (kn,t{ 1 - Ps) + (1 - - 1. 

Figure 1 gives a graphical overview of the described sam- 
pling procedure. Our sampling model has the following 
nine parameters: pi , pl,Ps ,Pg, 8, a,cu, I , N . The described 
sampling procedure allows sampling of data that exactly 
matches the model assumptions of all four models. To sam- 
ple BKT data we set 8 = a = oj = 0 and (7) simplifies to 
the standard BKT formulation. By setting ps = Pg = 0.5 
and w = 0 we can sample from an IRT model. To sample 
from an LFKT model we set w = 0 and for FAST none of 
the parameters are restricted. 

4. EXPERIMENTAL SETUP 

Parameter space. We generated a vast number of pa- 
rameter configurations in order to analyze the four models. 
The set of parameter configurations has been carefully de- 
signed to match real world conditions. The BKT parameters 
( Pi,Pg,Ps,Pl ) are based on the parameter clusters found 
on real world data [30]. Using a normal distribution with 
a standard deviation of 0.02, we sampled up to 30 points 
(depending on the cluster size) around each cluster mean. 
According to common practice [16] we scaled the student 
abilities 6 n to have a mean of 0 and a variance of 1 and 
therefore a = 1. We sampled the parameter 8 (determining 
the range of the item difficulties) uniformly from [0, 3] (ac- 
cording to [16]). Despite simulating only one skill, we varied 
the item difficulties to account for the fact that skill models 
tend to be imperfect in practice [7, 32, 20]. In accordance 
to the item difficulties, the feature weight u was varied uni- 
formly across [0, 1.5]. Feature values f n ,t were sampled from 
the uniform distribution U(— 1,1). 



Figure 1: Combined student model used for syn- 

thetic data generation. The model corresponds to 
LFKT with the addition of a single feature. The 
relative dependencies of the observable nodes (blue) 
and the latent nodes (white) are shown. k njS denotes 
the latent knowledge state, di the item difficulty, 9 n 
the student ability, y ni t the observation, and f„ t t the 
feature value. 

For every parameter configuration we generated five folds 
with N = 300 simulated students. Each fold was randomly 
split up into two parts of equal number of students. The 
first part was used as training data and the second part 
for testing. Therefore, the training data did contain unseen 
students only. As we simulated data from a mastery learning 
environment the number of tasks simulated for each student 
was determined by the moment of mastery. Based on the 
results presented by [11], we set the lag of the simulated 
system to L = 4 tasks from the moment of mastery. We 
simulated I = 15 different items with random item order. 

In total, we generated 66'000 parameter configurations for 
Pi,Pg,Ps,Pl,8, w, this amounts total evaluation time (train- 
ing and test) of 1’280 hours and 1’351 hours for LFKT and 
FAST respectively. The evaluation time for the BKT was 99 
minutes and all configurations were evaluated in 58 minutes 
for the IRT model. 

Implementation. To train BKT models we used our cus- 
tom code that trains BKT using the Nelder-Mead simplex al- 
gorithm minimizing the log-likelihood. We thoroughly tested 
our implementation against the BKT implementation of [39] . 
The IRT models were fit by joint maximum likelihood es- 
timation [21] implemented in the psychometrics library 1 . 
FAST using IRT features was shown to be equivalent to 
LFKT except for the parameter estimation procedure [19]. 
As this work did not investigate different parameter esti- 
mation techniques, both models were trained and evaluated 
using the publicly available FAST student modeling toolkit 2 . 

5. RESULTS AND DISCUSSION 

Using the generated data, we investigated the performance 
characteristics of the four models and evaluated their pre- 
dictive power and robustness under varying parameter con- 
figurations. For our results we generated bb'OOO parameter 

1 An open source Java library for measurement, available at 
https:/ /github.com/meyerjp3/psychometrics. 

2 http : / /ml-smores .github . io / fast / 
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configurations, and for each of them we generated synthetic 
data for 1'500 students. Note that there are many ways to 
characterize performance differences among student models 
and we only cover a subset of these possibilities. 

5.1 Error Metrics 

The right choice of error metrics when evaluating student 
models has recently gained increased interest in the EDM 
community. In [28] some of the common error metric choices 
are discussed, highlighting possible issues with the accuracy 
and area under the ROC curve (AUC) measure. Correla- 
tions between various performance metrics and the accuracy 
of predicting the moment of mastering a skill has been inves- 
tigated in [26], showing that the F-measure (equaling to the 
harmonic mean of precision and recall) and the recall are two 
metrics with a high correlation to the accuracy of knowledge 
estimation. The root mean squared error (RMSE) and log- 
likelihood, on the other hand, are well suited if one wants 
to recover the true learning parameters. Similarly, [10] con- 
cluded from results of 26 synthetic data sets that RMSE is 
better at fitting parameters than the log-likelihood. 

In line with this previous work we investigated correlations 
between accuracy, RMSE and F-measure across all four mod- 
els. For this, all models were trained and evaluated on data 
using OO'OOO different parameter configurations. All metrics 
are strongly correlated |p| > 0.75, p 0.001. Our inspec- 
tions of the metric correlations revealed no significant differ- 
ences in the metric correlations among the different models. 
Thus, to a large extent the measures capture equal char- 
acteristics for the models we considered in this work. In 
the following, we therefore focus our analysis on the RMSE 
measure. 

5.2 Model Comparison 

Overall Performance. In a first step we investigated the 
overall performance of the models. For every parameter con- 
figuration, we calculated the average RMSE over the five 
generated folds. Table 1 summarizes the parameters for the 
best and worst data set for every model when model assump- 
tions are met (see Section 3). Results show that all models 
that model a knowledge state (all except IRT) perform best 
if the slip probability is low and the guess probability is 
high. This leads to a data set that exhibits a high ratio of 
correct observations. IRT performs best on data that has 
very distinguished item difficulties (<5 is high). Notably the 
best performance of FAST is achieved on a data set with- 
out features ( u> = 0). We assume that this is due to the 
decreased complexity of the data set, compared to one that 
exhibits high uj. Consistently, worst case data sets exhibit 
high symmetric values for guess and slip probabilities. In 
the case of LFTK and FAST worst case data sets addition- 
ally do not distinguish between items (difficulty range 5 = 0) 
and for FAST the feature weights are low. 

We then performed the non-parametric Friedman test over 
all parameter configurations to assess performance differ- 
ences between the models. We found that there is a statisti- 
cally significant difference in the performance of the models 
(X' 2 (3) = 13' 065, p < 0.0001). Performing a post-hoc anal- 
ysis using Scheffe’s S procedure [31] shows all model differ- 
ences to be significant at p < 0.0001 with mean ranks of 
1.7156, 2.3017, 2.6898 and 3.2929 for FAST, LFKT, BKT, 


Table 1: Parameters of best and worst case data sets 
for each model. We only considered data sets that 
meet the model assumptions. Parameters denoted 
with * are fixed according to the model assumptions 
(see Section 3). 


Model 

S 

pi 

pL 

P S 

pG 

CO 

RMSE 

BKT 








Best 

0.00* 

0.71 

0.41 

0.01 

0.47 

0.00* 

0.25 

Worst 

0.00* 

0.10 

0.12 

0.50 

0.49 

0.00* 

0.48 

IRT 








Best 

3.00 

0.10 

0.08 

0.50* 

0.50* 

0.00* 

0.42 

Worst 

0.00 

0.10 

0.10 

0.50* 

0.50* 

0.00* 

0.50 

LFKT 








Best 

0.75 

0.69 

0.40 

0.01 

0.46 

0.00* 

0.25 

Worst 

0.00 

0.53 

0.16 

0.28 

0.29 

0.00* 

0.51 

FAST 








Best 

0.75 

0.67 

0.40 

0.01 

0.46 

0.00 

0.25 

Worst 

0.00 

0.56 

0.16 

0.28 

0.28 

0.00 

0.51 


and IRT, respectively. FAST therefore significantly outper- 
forms the other methods on our synthetic data sets. In [19] 
IRT performed not significantly worse than LFKT and FAST 
on four different data sets. The good performance of IRT 
was attributed to the deterministic item ordering that al- 
lows IRT to infer knowledge acquisition confounded with 
item difficulty. Our results support this hypothesis as in our 
synthetic data set the items are in random order and IRT 
exhibits the worst overall performance. 

Parameter Space Investigation. To gain a better under- 
standing of the performance characteristics of the different 
models, we analyzed their performances across the parame- 
ter space. For every pair of parameters pi and pj , we divided 
the parameter configurations into bins with similar values 
for pi and pj. We used five bins for each parameter ( pi 
and pj) resulting in a total of 25 bins. Performance of each 
model was assessed by calculating the mean RMSE for each 
bin. Significance of the observed performance differences 
was computed using the Friedman test and p < 0.05. 

Figure 2a shows the relative performance of the best model 
for each parameter pair. The models are color-coded: BKT 
is shown in red, IRT in green, LFKT in yellow, and FAST in 
blue. The color gradient indicates the relative improvement 
of the winning model over the second best model, where 
darker colors indicate higher values. White-colored areas 
indicate that there is no significant difference between the 
models. The plot shows that FAST is robust to parameter 
variations and outperforms the other models in large parts of 
the parameter space. In parts with low feature weights, i.e., 
where the feature / shows only a low correlation with task 
outcomes, LFKT outperforms FAST. When the variance 5 
of item difficulties di is low, BKT is the best model. A 
low variance in di implies a good skill model, with all tasks 
having approximately the same difficulty. 

In contrast to Figure 2a, where we assessed the prediction 
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(a) Relative improvement in task outcome prediction (RMSE). 
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(b) Relative improvement in knowledge state prediction (RMSE). 

Figure 2: Best performing models (RMSE) regarding prediction of task outcomes (a) and knowledge state 
prediction (b). The color for each bin indicates the best performing model, averaged over all other parameters. 
We investigated BKT (red), IRT (green), LFKT(yellow), and FAST(blue). White-colored bins exhibit no 
significant difference in model performance. The color brightness indicates the relative improvement of the 
best performing model over competing models, with dark colors referring to higher values. FAST is robust to 
parameter variations and outperforms the other models in large parts of the parameter space when predicting 
task outcomes (a). BKT is the best model if the variance of the item difficulty is low (a). BKT is superior 
to the other models in large parts of the parameter space when predicting knowledge states (b). 
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of task outcomes, we analyzed the quality of the prediction 
of knowledge states fc n ,t using the RMSE in Figure 2b. Ulti- 
mately, we want to predict whether a student has mastered a 
skill or not [26, 3]. The plot uses the same parameter pairs 
and color codings as Figure 2a. Interestingly, LFKT and 
FAST are not superior to BKT when it comes to prediction 
of the latent state. The additional parameters that LFKT 
and FAST use have a direct influence on the predicted task 
outcomes and therefore improve performance when predict- 
ing task outcomes. They have, however, no direct influence 
on the latent state k ni t of the model. 

Robustness. Next, we tested the robustness of the dif- 
ferent models against each other. We generated ideal data 
(meeting the model assumptions) for all the models and then 
interpolated the parameter values between these ideal cases. 
The classes of data sets that meet the model assumptions 
for the four models are described in Section 3. From every 
class of data sets, we selected the extreme case with the least 
amount of noise. In the following, we describe these cases. 

For BKT, data is generated using 5 = to = 0, assuming 
a perfect skill model (all tasks with same difficulty) and 
setting the influence of additional (not captured) features 
to 0. Furthermore, we removed the randomness by setting 
p G = p s = 0. For IRT, the extreme case data was generated 
using Pg,Ps = 0.5, to = 0 and by additionally setting 5 = 3. 
As LFKT is a combination of IRT and BKT, we set the pa- 
rameters to pg,Ps = 0.25 and 5 = 1.5. Furthermore, we 
set to = 0, again assuming no influence of not captured fea- 
tures. For FAST we used the same parameters as for LFKT, 
but additionally introduced a feature influence by setting 
to = 1.5. We linearly interpolated the parameter space in- 
between these extreme cases to asses model robustness when 
model assumptions are violated. Figure 3 displays the model 
with best RMSE in this subspace that contains the extreme 
(ideal) cases, where pl and pi are averaged over the BKT 
parameter clusters presented in [30]. From these results, we 
can see that BKT tends to be robust to increased feature 
influence as long as pg,Ps < 0.15. If the feature weight 
to > 0.75, FAST outperforms all the other classifiers. For 
large differences in item difficulties and large guess and slip 
probabilities, LFKT has a slight advantage over IRT. 

5.3 Parameter Influence 

To analyze the influence of the model parameters on the per- 
formance of the student models, we used linear regression to 
predict the RMSE based on the parameters of the sampling 
model. This allowed us to identify statistically significant 
correlations between the sampling parameters and the per- 
formance of the models despite the high dimensionality of 
the parameter space. 

The sampling parameters have a direct influence on the ra- 
tio of correct observations in the data, e.g., a high learning 
probability with low guess and slip parameters leads to a 
high ratio of correct observations. Further, if the parame- 
ters model fast learners then the average number of tasks 
tends to be low since we are simulating a mastery learn- 
ing environment. The three models IRT, LFKT and FAST 
which explicitly model items are sensitive to this kind of 
lacking data, as by having fewer observed items per student 
the estimation of item difficulty becomes more difficult. To 


Best performing model under breaking assumptions 



b = 0 b = 1 

co = b* 1.5 


Figure 3: Relative model performance on ideal data 
sets generated by linearly interpolating between pa- 
rameters. The colors refer to the models BKT (red), 
IRT (green), LFKT (yellow) and FAST (blue). The 
color gradient indicates the relative performance as 
in Figure 2a. BKT and FAST are more robust to 
the invalid assumptions of our experiment than IRT 
and LFKT. 


investigate the effect of both factors, we added the two vari- 
ables correct ratio and average number of tasks as predic- 
tors to the regression model. In order to make correlation 
coefficients comparable, all sampling parameters have been 
normalized to have mean 0 and standard deviation 1. 

Figure 4 shows the regression coefficients for all four models, 
with red and green denoting statistically significant and not 
significant coefficients, respectively. The variables correct 
ratio and average number of tasks have the largest influence 
on the RMSE. Both effects are significant and positive (re- 
ducing the RMSE). A larger range of item difficulties 5 has a 
positive influence on the performance of all models except for 
the BKT model. This is expected as BKT does not account 
for variations in item difficulty and thus larger variations in 
item difficulties are treated as noise by BKT, which makes 
prediction harder. IRT, LFKT and FAST, on the other 
hand, benefit from larger variations. We assume that this is 
due to the better identifiability of the effects of the different 
items. Interestingly, increasing the feature range to has no 
significant negative effect for the models that do not take 
features into account (BKT, IRT, LFKT), but has a posi- 
tive effect for FAST. The initial probability and the learning 
probability have a small negative and small positive effect 
on performance, respectively. While these coefficients are 
partially significant they have very small magnitude. The 
positive effect of the slip probability ps for all models ex- 
cept BKT (the effect is not significant) is rather surprising. 
However, the effect of a high slip probability in our sampling 
model is that it weakens the influence of the latent knowl- 
edge state on the task outcomes. This could explain the 
positive influence for models that estimate item difficulty, 
since the difficulty estimates are less convoluted with effects 
from the knowledge state. Further work is needed to prove 
this effect. 
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Regressing RMSE of BKT 

Predictor 

averageNumberOfTasks - 
correctRatio - 
initial (p ( ) - 
feature weight (co) - 
guess (p Q ) - 
slip (p s ) - 
learn (p L ) . 
difficulty range (8) - 

- 0.2 - 0.1 0 0.1 0.2 

Coefficent 



Regressing RMSE of IRT 


Predictor 



Regressing RMSE of LFKT 

Predictor 



Regressing RMSE of FAST 

Predictor 

averageNumberOfTasks - 
correctRatio - 
initial (p ( ) - 
feature weight (co) - 
guess (p G ) - 
slip (P s ) - 
learn (p L ) - 
difficulty range (8) - 
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Figure 4: Regression coefficients to predict RMSE based on the sampling parameter values for the models 
BKT, IRT, LFKT and FAST. Parameters with positive coefficients have a negative effect on the performance 
and vice versa. Red denotes significant coefficients with p < 0.001, green coefficients are not significant. 


6. CONCLUSIONS 

In this work, we investigated the performance characteristics 
of latent factor and knowledge tracing models by exploring 
their parameter space. To do so, we generated a vast amount 
of 66’000 synthetic data sets for different parameter config- 
urations containing data for 1’500 students each. Synthetic 
data allowed us to study the model performances under dif- 
ferent parameter settings, and to test the robustness of the 
models against violations of specific model assumptions. 

We showed best and worst case performances for all the 
models and investigated the relative performance gain in 
various regions of the parameter space. Our results showed 
that the two recently developed models LFKT and FAST, 
which synthesize item response theory and knowledge trac- 
ing, perform better than BKT and IRT. FAST even signif- 
icantly outperformed LFKT if reasonable features can be 
extracted from the learning environment. Interestingly, IRT 
exhibited the worst performance, which supports the hy- 
pothesis by [19] that random item ordering has a negative 
influence on the performance of IRT models. However, more 
analyses are needed to investigate this effect thoroughly. 
Further, we investigated the models’ abilities to predict the 
latent knowledge state and demonstrated that LFKT and 
FAST are outperformed by BKT. This raises the question 
of how to adjust the two recent methods LFKT and FAST if 
the aim is to predict knowledge states; we leave this explo- 
ration for future work. The analysis of the model robustness 
revealed that BKT is robust to increased feature influence 
for small guess and slip probabilities. For larger guess and 
slip, FAST outperformed the other methods. 


While all sampling parameters have been carefully chosen 
to match real world conditions, we expect real world data 
to exhibit more noise and additional effects not covered by 
our synthetic data. Thus, the achieved performance can be 
considered an upper bound on the performance achievable 
in real world settings. The performance of BKT depends 
on the quality of the underlying skill model. We have simu- 
lated imperfect skill models by introducing item effects, but 
we did not take other sources for imperfect skill models into 
account. Furthermore, the simulated data consisted of a 
fixed set of items. For tutoring systems offering many varia- 
tions of tasks, reliable estimation of item effects is challeng- 
ing, which in turn influences the performance of IRT, LFKT 
and FAST. Moreover, the performance of FAST is driven by 
feature quality, which may vary between different tutoring 
systems. 

Finally, it remains questionable whether and how the perfor- 
mance of the investigated techniques influences the learning 
outcome of students in a tutoring system. We show rela- 
tive improvements in RMSE between models of up to 6%. 
However, the effect of small-scale improvements in the ac- 
curacy of student models on the learning outcome has been 
discussed controversially [4, 39]. 
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